HMM Revises Low Marginal Probability by CRF for Chinese Word Segmentation

نویسندگان

  • Degen Huang
  • Deqin Tong
  • Yanyan Luo
چکیده

This paper presents a Chinese word segmentation system for CIPS-SIGHAN 2010 Chinese language processing task. Firstly, based on Conditional Random Field (CRF) model, with local features and global features, the character-based tagging model is designed. Secondly, Hidden Markov Models (HMM) is used to revise the substrings with low marginal probability by CRF. Finally, confidence measure is used to regenerate the result and simple rules to deal with the strings within letters and numbers. As is well known that character-based approach has outstanding capability of discovering out-of-vocabulary (OOV) word, but external information of word lost. HMM makes use of word information to increase in-vocabulary (IV) recall. We participate in the simplified Chinese word segmentation both closed and open test on all four corpora, which belong to different domains. Our system achieves better performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

HMM and CRF Based Hybrid Model for Chinese Lexical Analysis

This paper presents the Chinese lexical analysis systems developed by Natural Language Processing Laboratory at Dalian University of Technology, which were evaluated in the 4th International Chinese Language Processing Bakeoff. The HMM and CRF hybrid model, which combines character-based model with word-based model in a directed graph, is adopted in system developing. Both the closed and open t...

متن کامل

A Simple and Effective Closed Test for Chinese Word Segmentation Based on Sequence Labeling

In many Chinese text processing tasks, Chinese word segmentation is a vital and required step. Various methods have been proposed to address this problem using machine learning algorithm in previous studies. In order to achieve high performance, many studies used external resources and combined with various machine learning algorithms to help segmentation. The goal of this paper is to construct...

متن کامل

A Chinese Word Segmentation System Based on Structured Support Vector Machine Utilization of Unlabeled Text Corpus

We have participated in the open tracks and closed tracks on four corpora of Chinese word segmentation tasks in CIPSSIGHAN-2010 Bake-offs. In our experiments, we used the Chinese inner phonology information in all tracks. For open tracks, we proposed a double hidden layers’ HMM (DHHMM) in which Chinese inner phonology information was used as one hidden layer and the BIO tags as another hidden l...

متن کامل

An Double Hidden HMM and an CRF for Segmentation Tasks with Pinyin's Finals

In this paper, we present the proposed method of participating SIGHAN-2010 Chinese word segmentation bake-off. In this year, our focus aims to quick train and test the given data. Unlike the most structural learning algorithms, such as conditional random fields, we design an in-house development conditional support vector Markov model (CMM) framework. The method is very quick to train and also ...

متن کامل

Description of the NCU Chinese Word Segmentation and Named Entity Recognition System for SIGHAN Bakeoff 2006

Asian languages are far from most western-style in their non-separate word sequence especially Chinese. The preliminary step of Asian-like language processing is to find the word boundaries between words. In this paper, we present a general purpose model for both Chinese word segmentation and named entity recognition. This model was built on the word sequence classification with probability mod...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010